HTML Tag Based Metrics for use in Web Page Type Classification

نویسندگان

Jonathan Elsas

Miles Efron

چکیده

Traditional machine learning classifications of HTML documents focus on features drawn from terms in the documents, the link structure of groups of documents, or a combination of both. These techniques attempt to generate topical classifications of documents, with the hopes of mirroring a human's classification of pages into subject areas, thus facilitating retrieval. This paper presents an alternative method that aims at generating a "type-wise" classification of HTML documents. The types explored in this paper include tables, indexes, tables of contents, and textual content pages. These types of pages are of particular significance to the classification of documents on statistical web sites, which is one goal of the GovStat Project (http://www.ils.unc.edu/govstat), but also hold significance to HTML document collections at large.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)

The web is a huge repository of information which needs for accurate automated classifiers for Web pages to maintain Web directories and to increase search engines‟ performance. In web page classification problem each term in each HTML/XML tag of each Web page can be taken as a feature, an efficient methods to select best features to reduce feature space of the Web page classification problem d...

متن کامل

A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

This paper describes a fast HTML Web page detection approach that saves computation time by limiting the similarity computations between two versions of a Web page to nodes having the same HTML tag type, and by hashing the web page in order to provide direct access to node information. This efficient approach is suitable as a client application and for implementing server applications that coul...

متن کامل

Enhanced Information Retrieval by Using HTML Tags

Whenever digital libraries or knowledge management systems are to be automatically filled with web pages from the internet, document classification of the web pages is one of the major challenges. We present an approach which uses HTML tags in order to improve the quality of the hypertext document classification. Our approach uses weighting of HTML tags for separating relevant information in hy...

متن کامل

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm

It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting al...

متن کامل

An Approach to Content Extraction from Scientific Articles using Case-Based Reasoning

In this paper, we present an efficient approach for content extraction of scientific papers from web pages. The approach uses an artificial intelligence method, Case-Based Reasoning(CBR), that relies on the idea that similar problems have similar solutions and hence reuses past experiences to solve new problems or tasks. The key task of content extraction is the classification of HTML tag seque...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

HTML Tag Based Metrics for use in Web Page Type Classification

نویسندگان

چکیده

منابع مشابه

An Improved Optimized Web Page Classification using Firefly Algorithm with NB Classifier (WPCNB)

A fast HTML web page change detection approach based on hashing and reducing the number of similarity computations

Enhanced Information Retrieval by Using HTML Tags

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm

An Approach to Content Extraction from Scientific Articles using Case-Based Reasoning

عنوان ژورنال:

اشتراک گذاری